# Image-Text Retrieval

Siglip2 So400m Patch16 Naflex
Apache-2.0
SigLIP 2 is an improved model based on the SigLIP pre-training objective, integrating multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.
Text-to-Image Transformers
S
google
159.81k
21
Siglip2 Base Patch16 Naflex
Apache-2.0
SigLIP 2 is a multilingual vision-language encoder that integrates SigLIP's pretraining objectives and introduces new training schemes, enhancing semantic understanding, localization, and dense feature extraction capabilities.
Text-to-Image Transformers
S
google
10.68k
5
Siglip2 So400m Patch16 512
Apache-2.0
SigLIP 2 is a vision-language model based on SigLIP, enhanced with improved semantic understanding, localization, and dense feature extraction capabilities.
Text-to-Image Transformers
S
google
46.46k
18
Siglip2 So400m Patch16 384
Apache-2.0
SigLIP 2 is an improved model based on the SigLIP pre-training objective, integrating multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.
Text-to-Image Transformers
S
google
7,632
2
Siglip2 So400m Patch16 256
Apache-2.0
SigLIP 2 is an improved model based on SigLIP, integrating multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.
Text-to-Image Transformers
S
google
2,729
0
Siglip2 Giant Opt Patch16 384
Apache-2.0
SigLIP 2 is an improved model based on the SigLIP pretraining objective, integrating multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.
Text-to-Image Transformers
S
google
26.12k
14
Siglip2 Giant Opt Patch16 256
Apache-2.0
SigLIP 2 is an advanced vision-language model that integrates multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.
Text-to-Image Transformers
S
google
3,936
1
Siglip2 Large Patch16 384
Apache-2.0
SigLIP 2 is an improved multilingual vision-language encoder based on SigLIP, enhancing semantic understanding, localization, and dense feature extraction capabilities.
Text-to-Image Transformers
S
google
6,525
2
Siglip2 Large Patch16 256
Apache-2.0
SigLIP 2 is an improved vision-language model based on SigLIP, integrating multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.
Text-to-Image Transformers
S
google
10.89k
3
Siglip2 Base Patch16 512
Apache-2.0
SigLIP 2 is a vision-language model that integrates multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.
Text-to-Image Transformers
S
google
28.01k
10
Siglip2 Base Patch16 384
Apache-2.0
SigLIP 2 is a vision-language model based on SigLIP, enhancing semantic understanding, localization, and dense feature extraction through a unified training approach.
Image-to-Text Transformers
S
google
4,832
5
Siglip2 Base Patch16 256
Apache-2.0
SigLIP 2 is a multilingual vision-language encoder with improved semantic understanding, localization, and dense feature extraction capabilities.
Image-to-Text Transformers
S
google
45.24k
4
Siglip2 Base Patch16 224
Apache-2.0
SigLIP 2 is an improved multilingual vision-language encoder based on SigLIP, enhancing semantic understanding, localization, and dense feature extraction capabilities.
Image-to-Text Transformers
S
google
44.75k
38
Siglip2 Base Patch32 256
Apache-2.0
SigLIP 2 is an improved version of SigLIP, integrating multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.
Text-to-Image Transformers
S
google
9,419
4
Tic CLIP Basic Oracle
Other
TiC-CLIP is an improved vision-language model based on OpenCLIP, focusing on continual temporal learning, with training data spanning from 2014 to 2022
Text-to-Image
T
apple
37
0
Clip Japanese Base
Apache-2.0
A Japanese CLIP model developed by LY Corporation, trained on approximately 1 billion web-collected image-text pairs, suitable for various vision tasks.
Text-to-Image Transformers Japanese
C
line-corporation
14.31k
22
Japanese Clip Vit B 32 Roberta Base
A Japanese version of the CLIP model that maps Japanese text and images into the same embedding space, suitable for zero-shot image classification, text-image retrieval, and other tasks.
Text-to-Image Transformers Japanese
J
recruit-jp
384
9
Align Base
ALIGN is a vision-language dual-encoder model that aligns image and text representations through contrastive learning, achieving state-of-the-art cross-modal representation with large-scale noisy data.
Multimodal Alignment Transformers English
A
kakaobrain
78.28k
25
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase